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An approximate textual retrieval algorithm for searching sources with high levels of defects is 
presented. It considers splitting the words in a query into two overlapping segments and subsequently 
building composite regular expressions from interlacing subsets of the segments. This procedure 
reduces the probability of missed occurrences due to source defects, yet diminishes the retrieval of 
irrelevant, non-contextual occurrences. 



Errors in electronic texts too often hinder the complete retrieval of the intended search. Several approximate string 
matching or fault-tolerant techniques have been devised to minorate their impact [1-4]. Albeit the variate number of 
existing methods, which particularize specific retrieval needs, are all based on word or character insertions, deletions 
and substitutions, which are performed within a prescribed threshold of string similarity. 

This article presents an algorithm for approximate string matching, suitable for searching sources with high levels of 
defects. Concretely, it is devised to search collections of scientific texts which are often encoded in electronic formats 
that were originally created for printing and screen presentations. Furthermore, earlier texts are recompiled through 
error-prone, optical character recognition techniques. 

Query strings are first split into words. Words are then divided into two possibly overlapping segments, a prefix 
and a suffix. Interlaced subsets are finally picked up from the ordered set of segments to form a composite regular 
expression. On one hand, the composite regular expression notably reduces the probability of missing a hit due to 
uniformly distributed errors in the document. On the other hand, word segmentation and query sectioning might lead 
to unveil hidden words perhaps irrelevant to the query context. 

Eliding parts of words or sentences, such as 'telephone' being set to 'phone' or 'zoological gardens' to 'zoo', and 
morphology derivations preserve in many instances the semantics of the context [5] . The proposed algorithm partitions 
words into their morphological constituents. The prefix segments embrace prefix and root; the suffix segments, the 
root and suffix. This partition gives longer segments and therefore reduces the probability of irrelevant retrievals. 
The interlaced query sectioning and inter-segment gap lengths are interrelated parameters in the algorithm. They 
refer, intuitively, to an attention and resolution window within which the documents are scanned. 

In the end, the extra computational effort that is necessary to reduce the probability of missing a hit pays off when 
additional, related hits are retrieved as well. 



Let T be a text document constituted by a sequence tit2... of words, which, in turn, are sequences of characters 
over an alphabet S. Let the query Q on T for the word pattern Q = QiQ2'--Qn 

be defined as the Boolean function 



The retrieval of pattern Q from T is then the set of positions j for which the query Q is true. 

If the word similarity relationships [qi ~ tj+t] are set to equalities, [qi = tj+j], the probability of missing one pattern 
occurrence due to uniformly distributed errors in T is proportional to the length of Q. On the other hand, the number 
of occurrences of Q is proportional to the length of T, whenever T is a random text. 



I. INTRODUCTION 



II. APPROXIMATE TEXTUAL RETRIEVAL ALGORITHM 



QtU] = [qi ~ tj+i] A [(72 ^ tj+2] A ... A [qi tj+i] A ... A [g„ tj+n]- 



(1) 



A. Approximate composite queries 



Let the words in Q be split into two segments, q^ and g", such that 

q = qPUq', 



(2) 
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and let the word similarities [q ^ t] be set to [qP C t] and [q^ C t], meaning that q is similar to t if qP or q^ are 
segments of t. Furthermore, let Q be sectioned into b interlaced blocks to build component queries. For 6=2, the 
two component queries are 

= [qP C tj+,] A [qP C tj+2] A ... A [g^ C tj+n] 
Q" = [qI C tj+i] A C tj+2] A ... A [g^ C 

and the composite query Q is 

More generally, being R the relabeled sequence Q of words 

R = rir2...rm = qiqlq2Q2-<ln'ln 

the component queries are 

= [nj A rSdi,i+bJ A [n+bj A [Edi+f,,i+2f,J A 
= [raj A l^d2,2+b\ A [rs+bj A [1:^2+6,2+26] A 

Tl'' = \n\ A \'Edb,2b\ A [rafej A [Sdab^sbJ A ... A [rL^/f,jbJ. 
The approximate composite query TZ derived from Q is then the union 

6 

'R.= \J'}1''. (5) 
fe=i 

In fact, a is an alternated regular expression. Notation [-J indicates match on T. [S„J denotes match any segment 
of characters in alphabet E whose length Z is < Z < n, and di^i' is the distance in characters from the last position 
of word i to the begin of word i' . 

By construction, any component query TZ'' will match Q in T provided Q does. Their probabilities of missing 
one occurrence due to random errors, pk, are approximately equal to the one that Q has, divided by b. For the 
approximate composite query TZ, however, such probability is notably reduced, being of the order ofp^. 

The expected number of matches that a regular expression of the form of TZ'' will find in a random text has been 
analyzed by Flajolet, Szpankowski and Vallee [6]. If counts the occurrences of pattern Q in T, the expectation 
EI^Iq] is approximately 

E[nQ] = lTnidu'PiQ), (6) 

with It being the length of T, dw the subpattern distances, and P{Q) the probability of Q. The expectation for a 
composite expression TZ is, therefore, approximately b times ^^[Oq]. 



(3) 



(4) 



A \ri+^rn/b]b-b\ 
A r?'2+rm/616-6j 



B. The algorithm 

As it has been implemented, the algorithm distinguishes two particular cases, one for single and the other for 
multiple word queries. Since the number of blocks 6 cannot be greater than one plus the number of words, and 
since is b what permits escaping source defects, a word having errors in the segment q^ n q'^ could not be matched. 
Furthermore, for a word without a clear morphological partitioning, q^Dq'' is equal to q. This case, therefore, is treated 
separately, by considering that one single word can have as much one single error, placed anywhere, but extending to 
no more than two contiguous characters. This is a simple application of the insertion, deletion, substitution paradigm. 
For the sake of completeness this case is also included here. 

The pseudo-codes for the two cases are listed in Algorithm 1 and 2, for the multiple and single word cases, respec- 
tively. They are implemented in the CB2Bib program in version 0.8.2 [7]. 
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Algorithm 1 Approximate composite queries 

1; for all q e Q\lq > 3 do 

2: Split q into g*" and with q = q^ Uq" 

3: ii ^ Ji U / U g" 

4: end for 

5: for j = 1 to & do 

6; TZ^TZU In. {0,di,i+b}ri+b-{0,di+b,i+ib} ■■■ f j+rm/6i6-i,J 
7: end for 



Algorithm 2 Approximate single word matching 

1: if < 3 then 
2: TZ^q 
3: return 
4: end if 

5: for 4 = 1 to Zq do 

6; '}Z^'R.U\q[l:i-l].{0,2}q[i + l:lg]i 
7: end for 



C. Remarks 

Word partition. The (approximate) partitioning of words into morphological parts is performed using a look-up 
table composed of 1630 prefixes and 1133 suffixes. The listed affixes also include combinations of them. In this 
manner, quant. ize.d, for instance, will show its root quant in the prefix+root portion, as it will be shown by 
the related forms quant. ization or quant, um. The word quantized is therefore split into quant and quantized. 
This produces longer forms that lower the probability P{Q) in equation 6, and, hence, the chance of unrelated 
occurrences. 

Interlacing blocks. The number of blocks b expresses the portion of the query used by a composite regular expression 
to scan the sources, being 

b = mm[bmax, 1 + 100/percentScan]. (7) 
The maximum number of blocks, bmax is ^m, or simply, the number of words n. 

Misses and recall tradeoff. Besides setting the number of interlacing blocks, establishing appropriate gap distances 

rf,;. ;/ is relevant regarding the tradeoff between missing occurrences and overwhelming with unrelated ones. These 
two tuning parameters are interrelated, being the minimum allowable distances dependent on the number of 
blocks b. High values of b, or low percent scanning, greatly reduce the probability of misses, but they increase 
the value of the product of distances dn' in equation 6. In the current implementation, and for the examples 
given in this work, the percent scanning has been set to 50%. Distances preceding high frequency words, i. e., 
words with four or less characters, are set to three times their minimum allowable value. In the other cases, 
they are set to either twenty times the difference i' — i, or three times the allowable minimum, which ever is 
greater. This convention is appropriate for searching a personal collection, where hits are hardly irrelevant, due 
to its reduced and selected nature. 



D. Examples 

Two detailed examples are given to illustrate the algorithm for the cases of single and multiple word queries. The 
queries are performed on the set of articles cited in this work. Bold face font is used to highlight matched string 
segments. 

1. Single word matching 

This example is taken from the work of Wang, Li, Cai, and Chen [3] , on approximate string matching in biomedical 
text retrieval. The search for 'chinensis' yields the regular expression: 

(? : c (? : hinensi I hinen . {0 , 2>s I hine . {0 , 2}is I hin . {0 , 2}sis I hi . {0 , 2}nsis I h . {0 , 2>ensis I . {0 , 2}nensis) I hinensis) 
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and produces two hits, 

• ...or -icus. Thus the name of 'Bupleurum chinense' is incorrect and the correct name is "Bupleurum 
chinensis" as shown in Table 1. There are also... • ...// Grammatical error 86.6 Bupleurum chinense Bupleurum 
chinensis Collection II Grammatical error 89.5... • ...alba Collection II 29 36 Bupleurum chinense Collection II 
23 28 Cinnamomum. . . • ...sachalinense Phellodendron chinense 84.2 Salvia przewalskii Sabina... 

from reference [3] , matching the two spellings of the herb, and also, 

• ...tenths of seconds per megabyte. Our machine is a Sun UltraSparc-1 with 167 MHz and... 
from reference [2], and clearly not relevant. 



2. Composite queries 

The search for 'Aproximate textual retrieval' -note typo- gives the word segments Aproxim, roximate, textu, textual, 
retriev, and rieval, and the three-component regular expression: 

Aproxim . {0 , 60}textual 
roximate . {0 , 60}retriev 
textu . {0 , 60}rieval 

alternated as 

(? : Aproxim . {0 , 603-textual I roximate . {0 , 60}retriev I textu . {0 , 60}-rieval) 

It retrives the following texts, 

• ...the results above show that, for approximate matching, they have speed and retrieval effectiveness 
similar to that of 3... 

• ...references about the relation of approximate string matching and information retrieval are Wag- 
ner and Fisher [1974... • ■■■2000. Blockaddressmg indices for approximate text retrieval. J. Am. Soc. Inf. 
Sci. (JASIS) 51... • ...SCHULMAN, E. 1997. Applications of approximate word matching in information 
retrieval. In Proceedings of the 6th ACM... 

• ...Assessment of approximate string matching in a biomedical text retrieval problem J.F. Wang, a, 
Z.R. Lia,b , C... 

• ...Keywords Fuzzy matching, approximate information retrieval, fault-tolerant fulltext search, q... • 
...metric, used by most available approximate text retrieval algorithms, is not appropriate when... 

from the references [1] , [2] , [3] , and [4] , respectively. 

Note that the three words in the query appear in two, and only two, component expressions. Therefore, if the 
segment Approximate textual retrieval had been in the texts, the occurrence would have certainly been retrieved, 
provided that the errors did not extend to more than one of the three words. 
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